library(HardyWeinberg)
library(TeachingPopGen)

Introducing Fst

Having spent some time thinking about the history of alleles, we now want to turn our attention back to genotypes and their distributions. Of course, all of our considerations in this context have been in terms of departure from Hardy-Weinberg expectations, which we can test for significance by a χ2 test or related procedure.

But let’s think about it in another way. Remember that, for a biallelic locus, we have one parameter, p, that we estimate from the data; q is simply 1-p, and based on those, we calculate the expected frequency of heterozygotes, or heterozygosity, as 2pq. So in fact, any departure from Hardy-Weinberg will in fact be manifest as a difference between observed and expected heterozygosity. Thus, let’s define

Where Hobs is the observed frequency of heterozygotes in our population. This relationship can also be written as

Note that if Hobs=2pq, then F=1-1=0; if Hobs0 and vice versa.

The simple case - one population

We can start with one population, of size 100. It has some allele frequency p. We’ll generate some data as follows:

set.seed(1123)
pop1 <-HWData(nm=1,n=100,f=runif(1,0,1), p=runif(1,.05,.5))# generate data for a random locus with some p between .05 and .5 and some f between 0 and 1
pop1
## AA AB BB 
## 27 10 63

Now we want to calculate the expected heterozygosity (2pq). This is fairly straightforward; we can do it as follows

p <-(2*pop1[1]+pop1[2])/(2*sum(pop1))
Hexp <-2*p*(1-p)
Hexp
##     AA 
## 0.4352

And of course, Hobs is the observed frequency of AB, the second element in pop1:

Hobs <-pop1[2]/sum(pop1)
Hobs
##  AB 
## 0.1
F.pop1 <-1-(Hobs/Hexp)
F.pop1
##        AB 
## 0.7702206

So if this were all we had, we would conclude that observed heterozygosity is less than expected, suggesting, for example, some form of assortative mating or inbreeding, leading to the observed reduction.

Adding Population Subdivision - the Wahlund effect

Now consider the following - we have two populations with the following allele frequencies

pop1 <-c(.64,.32,.04)
pop2 <-c(.09,.42,.49)

We saw easily seethat both are in Hardy-Weinberg equilibrium, so in the context of our current consideration

  • F(pop1) = F(pop2) = 0

But then we asked what would happen if a single sample were considered, with 2/3 of the individuals coming from pop1 and 1/3 from pop2:

popt <-(2/3)*(pop1)+(1/3)*(pop2)
popt
## [1] 0.4566667 0.3533333 0.1900000

We can now look at the observed and expected heterozygosity in the combined sample, using a function in the TeachingPopGen package.

f.comb <-Fis(popt)
f.comb
## [1] 0.2392344

And we see that again, observed heterozygosity is lower than expected.

This is an example of something known as the Wahlund Effect. In simple terms, it states that if a sample is derived from multiple populations whose allele frequencies differ, then there will be an apparent deficit of heterozgyotes relative to those expected based on the overall allele frequency, even when the individual populations are themselves in Hardy-Weinberg Equilibrium.

So, suppose we’ve surveyed a sample of individuals, and we find some nonzero value of F. How do we interpret it? Is it due to within-population departure from HW, among subpopulation variance in allele frequency, or both? In what follows, we will see how we can separate these two factors, and in so doing introduce ourselves to one of Sewall Wright’s greatest contributions to theoretical population genetics.

Some hypothetical data:

In the preceding discussion, we showed two ways , given a single sample of genotypes, that an apparent heterozygote deficit could result. Now, let’s suppose we have actually collected 5 separate samples, perhaps from different locations, and we want to assess these two sources of departure from Hardy Weinberg. To keep the arithmetic simple, we’ll have equal size samples from each subpopulation (100).

Generate the data

We’ll use the package HardyWeinberg to generate our data, allowing for both variation in allele frequency (p) and within-subpopulation departure from expected heterozygosity (f):

dat <-(sapply(c(1:5),function(x) (HWData(nm=1,n=100,f=runif(1,.05,.2),p=runif(1,.1,.9)))))
rownames(dat) <-c("AA","AB","BB")
dat
##    [,1] [,2] [,3] [,4] [,5]
## AA    8   31   21   32   62
## AB   43   41   45   48   29
## BB   49   28   34   20    9

Departures from HW within Populations

First, we can simply ask qualitatively whether heterozygosity in each of the five populations differs from expected:

f1 <-apply(dat,2,Fis,freq=FALSE)
f1
## [1] -0.03377810  0.17926134  0.08452853  0.02597403  0.19343624

And we see that we have varying degrees of departure from expectation.

Restructuring the Data

Think back on what we’ve said so far about within-population departures from expectations and those resulting from the Wahlund effect. In the case of the former, the critical parameter is observed heterozygosity; in the latter, the critical elements are the allele frequencies of the subpopulations. So let’s create a matrix that includes those numbers - p, q, and Hobs - for each population

dat <-apply(dat,2,function(x) x/sum(x))
p <-apply(dat,2,function(x) x[1]+.5*x[2])
dat.2 <-rbind(p,1-p,dat[2,])
rownames(dat.2) <-c("p","q","Hobs")
dat.2
##       [,1]  [,2]  [,3] [,4]  [,5]
## p    0.295 0.515 0.435 0.56 0.765
## q    0.705 0.485 0.565 0.44 0.235
## Hobs 0.430 0.410 0.450 0.48 0.290

Calculating Heterozygosities

Now we want to calculate two measures of heterozygosity:

  • Hi - this is simply the mean of the five subpopulation heterozygosities - (∑Hobs)/n
  • Hs - This is the mean of the expected heterozygosities of the 5 populations - (∑2pq)/n

Where n is the number of populations (5 in this case)

Hi <-mean(dat.2[3,])
Hs <-mean(2*dat.2[1,]*dat.2[2,])
Hi; Hs
## [1] 0.412
## [1] 0.45188

Fis

So we now have a mean of the observed heterozygosity (Hi) and one of the expected (Hs). Referring back to Equation 2, we can now define

  • Fis = (Hs-Hi)/Hs

Which is easily calculated

f.is <-(Hs-Hi)/Hs
f.is
## [1] 0.08825352

So we now have a statistic that quantifies the mean effect of departure of heterozygosity from expectation within subpopulations. Keep this in mind; we now need to quantitate the effect of the Wahlund effect.

What About Subdivision?

In the previous calculations, we came up with Hs, the average expected heterozygosity of the five subpopulations. What we now want to ask is what the expected heterozygosity would be if in fact the samples were derived from a single HW population, with one value of p. To get that, we define one more heterozygosity statistic:

  • Ht =2 mean(pi) mean(qi)

This can be easily calculated as follows:

Ht <- 2*mean(dat.2[1,])*mean(dat.2[2,])
Ht
## [1] 0.499608

Fst

So once again, we have a comparison we can make - the expected heterozygosity for a single population with p=mean(p) and the heterozygosity expected based on the observed allele frequencies of the populations. We can therefore apply Equation 2 again as follows

Fst <-(Ht-Hs)/Ht
Fst
## [1] 0.0955309

This is the critical measure. It is, as we shall see shortly, a function of the variance in allele frequences among populations. It is fairly easy to show that if var(p)=0, then Fst- = 0, but more on that later.

For completion’s sake, Fit

We now have three measures of heterozygosity:

Hi; Hs; Ht
## [1] 0.412
## [1] 0.45188
## [1] 0.499608

And we have calculated two F statistics:

f.is;Fst
## [1] 0.08825352
## [1] 0.0955309

We could also ask what the mean reduction of heterozygosity within populations relative to the expected total heterozygosity:

Fit <-(Ht-Hi)/Ht
 Fit
## [1] 0.1753535

Quite honestly, this may be the last time we will ever see this statistic. Fis and Fst are much more meaningful biologically.

Fst and Allele frequency variance.

So far, we have alluded to the idea that Fst is a function of allele frequency variance. It is fairly easy to formalize that relationship. Let’s start with the difference between the observed and expected frequency of homozygotes for one allele:

  • freq(aa)obs - freq(aa)exp = ∑qi2 /n - mean(qi)2
  • = σq2

Similar logic could be applied to f(AA), so the total difference in homozygosity resulting in subdivision would be 2*σ2, which in turn will equal the negative of Hobs-Hexp. So we have

  • -(Hs-Ht) = Ht-Hs =2σ2

and

  • Fst = (Ht-Hs)/Ht = 2σ2 /(2pq)
  • = σ2/pq

Where p and q are the means of the subpopulation allele frequencies.

We can show this for our data we have been analyzing as follows

sigma <-var(dat.2[1,])*((5-1)/5) #second term is correction to get population vs. sample variance
Fst.var <-sigma/(mean(dat.2[1,])*mean(dat.2[2,]))
Fst.var
## [1] 0.0955309

And we see that we get the same value as we did when we calculated Fst previously (0.0955309). This observation is experimentally very important - it means that, if what we are interested in is population subdivision, then all we need are the allele frequencies of the subpopulations, not the genotype frequencies, in order to compute Fst.

Conclusion

So what we have done, in effect, is to partition the deviation from Hardy Weinberg into components based on departure from heterozygosity within populations (resulting from some form of nonrandom mating, perhaps) and that associated with the Wahlund effect (the result of allele frequency variation among populations). The basic mathematics here are quite simple, and for most of our purposes, will suffice. You should be aware, however, that other measures of subdivision (e. g. the Weir and Cockerham statistics), which are more amenable to significance testing, are also used. However, while we may touch on a few of them in the future, at this point we need to turn our attention to the analysis of real data.

An example - the Idaho Giant Salamander

In our previous unit, we saw how we could quantify population subdivision based on variation in allele frequencies among subpopulations. However, we only considered the simplest possible case, one in which there are some number of populations, and we want to assess how well they are differentiated from one another. But consider another possibility - suppose we’d like to look at multiple levels of division - for example among populations within a region, among regions within a continent, etc. A wonderful paper on Idaho Giant Salamanders (???) does just that (albeit on a smaller scale). This is the creature:

IDS1

IDS1

And the habitat they examined in Northern Idaho was ideal for nested Fst analysis. They looked at three river basins - the Lochsa, the St. Joe, and the St. Regis. Each of those basins consist of multiple catchments, each one of which is fed by a number of streams. A schematic of a typical basin is shown below:

IDS2

IDS2

(Figure 2 from Mullen et al., 2010)

More detail can be seen in their Figure 3, which shows the overall region, as well as details of the three basins:

IDS3

IDS3

What Mullen et al set out to do was to ask which of the four models illustrated below best describes the population structure of these salamanders (their Figure 1).

IDS4

IDS4

So which of these best fits the data?

The Analysis

To do our analyses, we will use data made available on Dryad. It consists of microsatellite data for 14 loci, scored in 15 samples from each stream in the study design. We will use the hierfstat package to do so; this package, while powerful, has a rather steep learning curve, and furthermore requires much in the way of data manipulation before it can be used constructively. Accordingly, the data have been processed appropriately and included in TPG.

To start with, we can look at the data:

data(IGS)
head(IGS)
##   Catchment  Basin Stream  1  2  3  4  5  6  7  8  9
## 1    Badger Lochsa   LBU1 45 33 46 33 56 23 23 22 44
## 2    Badger Lochsa   LBU1 55 33 34 33 55 34 13 22 44
## 3    Badger Lochsa   LBU1 55 33 44 33 55 23 12 22 44
## 4    Badger Lochsa   LBU1 55 33 45 33 56 33 13 22 44
## 5    Badger Lochsa   LBU1 55 34 46 33 56 23 33 22 44
## 6    Badger Lochsa   LBU1 55 33 34 33 55 34 33 12 34

And we see it’s a data frame, consisting of three columns (as factors) identifying basin, catchment and stream, followed by diploid genotypes for nine microsatellite loci, with each allele given a single number. We can now, with a little more manipulation, get the numbers we want.

First, we are coing to convert the basin, catchment and stream id’s to numbers

basin <-as.numeric(IGS$Basin)
catch <-as.numeric(IGS$Catchment)
stream <-as.numeric(IGS$Stream)

Now we are going to subset the data so that it just contains the genetic data (that is, we will exclude the first three columns):

loci <-data.frame(IGS[,4:12])

Now, we will do our nested F statistic analysis of the data. We will use the function varcomp.glob() to do so. The syntax may look a bit arcane, but it does make some sense. What we are telling it to do is to apply our three levels (basin, catchment, stream) in that order to the data frame containing the genetic data (loci) and do all of the calculations:

f.out <-varcomp.glob(data.frame(basin,catch,stream),loci)

And we can see the results as

kable(f.out$F)
basin catch stream Ind
Total 0.1302709 0.3616678 0.4175145 0.4035924
basin 0.0000000 0.2660563 0.3302679 0.3142605
catch 0.0000000 0.0000000 0.0874885 0.0656783
stream 0.0000000 0.0000000 0.0000000 -0.0239012

In fact, what we are most interested in are the numbers on the diagonal. So, for example, the first cell of the first column provides a measure of differentiation among basins relative to the total, the second element of the second column reflects differentiation among catchment relative to basin and so forth. To make that clearer, those values are rearranged below:

f.diag <-diag(f.out$F)
lev <-c("Among Basins","Within Basins Among Catchments","Within Catchments among Streams","Within Streams (Fis)")
fin <-data.frame(Level=lev,F=round(f.diag,3))
kable(fin)
Level F
Among Basins 0.130
Within Basins Among Catchments 0.266
Within Catchments among Streams 0.087
Within Streams (Fis) -0.024

Looking at these results qualitatively, we might conclude that there is significant differentiation among basin and within catchments among streams, but significantly less among streams within catchments. And of course, the value of Fis suggests a probably insignificant excess of heterozygotes at the local level.

Testing for significance

We have so far been assessing our results qualitatively, and for many purposes that is sufficient. However, hierfstat does provide one test of significance, in which a test statistic calculated from the data is compared to data sets generated by randomly permuting the actual data (in this case, randomly assigning salamander genotypes to different streams). The statistic calculated, G, is a likelihood ratio, comparing the likelihood of observing the particular data (actual or randomized) to that expected based on overall allele frequencies (i. e. no subdivision).

As an example, we will ask whether the observed differentiation among streams within catchments (0.0874885) is significant. We do the test as follows:

sig.str <-test.within(loci,test=stream,within=catch,nperm=1000)

The last number in the first element of the list returned is the observed value, so we can compare it to the ones from the randomized permutations. Thus, we can compare our observed see where our value lies relative to the upper and lower 2.5% of the values generated with the permuted data as follows:

obs <-sig.str[[1]][1000]
dst <-sig.str[[1]][1:999]
obs
## [1] 7073.682
quantile(dst,c(.025,.975))
##     2.5%    97.5% 
## 6368.138 6469.537

And we see that, based on this analysis, the distribution of genotypes among streams departs significantly from that expected if they were distributed randomly among them.

Conclusion

So what have we accomplished so far? In the case of the Idaho Giant Salamander, differentiation among catchments is the most striking, suggesting limited or no gene flow among them, akin to the Death Valley Model. In contrast, when we look within catchments, while significant differentiation exists among streams, its lower level is consistent with more gene flow, perhaps consistent with the stream hierarchy model. But at this point, we are basing everything on an overall assessment of population structure. What we have not done is to look at particular comparisons. For example, how does differentiation among streams within one particular catchment compare with another?

Application to Human Data

In this unit we want to do three things. First, from our browser, we will take a look at ALFRED, a database of human allele frequencies maintained by Ken Kidd at Yale University. Second, we will use some of those data to introduce the idea of nested F statistic analysis, wherein we ask at what level (population and/or continent) population subdivision occurs.

Loading the Data

In a very compelling paper, (???) reported evidence showing that for a region of chromosome 21, a particular haplotype of Neanderthal origin has been the subject of recent positive selection in Tibetan populations. The evidence included geographic differentiation of allele frequencies of SNPs in the resion. One such marker is rs11130248; we will use frequency data from the ALFRED database to do a bit of F statistic analysis.

First of all, we need to read the data. To do so, we are going to “scrape” it directly from the Internet, using a function in the package XML. Before we can do so, however, we need to browse to the Allele Frequency table for the data; to get there, search for the SNP on Alfred. You should get to this page; on it you will note that there is a table at the top that includes a cell with the header “Alfred UID”. This number needs to be in the following code.

library(httr)
siteid <-"SI087775I" #and one for the ALFRED ID of the SNP we are interested in
dat <-read.alfred(siteid) # Read and clean the data
head(dat)
##   Continent     Population N2    P    Q
## 3    Africa Bantu speakers 24 0.75 0.25
## 4    Africa            San 12 1.00 0.00
## 5    Africa          Biaka 62 0.90 0.10
## 6    Africa          Mbuti 30 0.93 0.07
## 7    Africa         Yoruba 48 0.71 0.29
## 8    Africa       Mandenka 48 0.85 0.15

Examining the data

As we can see, the read.alfred function finds the data we want and returns it as a dataframe, with one line per sample and columns consisting of the continent of origin, the population within that continent, 2N, and the two allele frequencies P and Q. Note that ALFRED does not contain individual data; hence we can only look at subpopulation differentiation and higher - we can’t look at within-population structure.

A quick histogram of everything

So what do these data look like? We can do a quick bar plot of allele frequencies,labeling the bars by continent, as follows:

barplot(dat$Q, names.arg=dat$Continent,las=2,main="Figure 1")

And we see, exactly as reported, that this particular allele is almost nonexistent in European populations, is variable in African ones, and is at the highest frequencies in Asia and North America (Native Americans of Asian origin)

Aggregating the data

To make this a little cleaner (and for some subsequent analysis), we can aggregregate the data by continent and replot it:

dat.cont <-aggregate.alfred(dat)
head(dat.cont)
##      Continent  N2  Nq
## 1       Africa 284  35
## 2         Asia 634  40
## 3     EastAsia 408 198
## 4       Europe 320   1
## 5 NorthAmerica 100  64
## 6      Oceania  72   8
barplot(dat.cont$Nq/dat.cont$N2,names.arg=dat.cont$Continent,las=2,main= "Figure 2")

Calculating Fst

By Continent

Note that we now have the data in two forms - by population and aggregated by continent. We can start by simply calculating Fst at the continental level

fst.c <-fst.cont(dat.cont)
fst.c
## [1] 0.3862768

Remembering that Sewall Wright suggested that values of Fst > .05 suggest some degree of population differentiation, this result suggests that further investigation may be warranted.

By Continent and Population

The value we have determined for fst, 0.3862768, suggests that strong allele frequency differentiation exists among continents. But remember that we also saw (Figure 1) that those frequencies vary among populations from the same continent as well. We can address that by calculating nested F statistics as follows:

fst.n <-fst.nest(dat)
fst.n
##         Fsr       Frt      Fst
## 1 0.1701899 0.1755475 0.315861

And we can summarize the results:

Source Statistic Value
Among Populations Within Continent Fsr 0.1701899
Among Populations Relative to Total Fst 0.315861
Among Continents Relative to Total Frt 0.1755475

Interpretation

By doing the nested analysis, we see that, while intercontinental differentiation is higher than we typically see with human data, comparable diferences exist among populations within regions. In fact, looking back to Figure 1, this should not be surpising. Except for the European populations, we observe substantial allele frequency variation on each continent, and that is what is reflected in the Fsr statistic

A Contrasting Case

A critical question that emerges from this sort of analysis is whether the results obtained are specific to the gene or region under consideration, or whether they in fact reflect population structure. We will look at a couple of studies on that point in a bit, but for now, we can select another snp, rs10488710. We’ve looked up the UID number, and we can input it as follows

siteid2 <-"SI001899B"

dat.2 <-read.alfred(siteid2)

Repeating the steps above, we can get plots both by population and by continent

barplot(dat.2$Q, names.arg=dat.2$Continent,las=2,main="Figure 1")

dat2.cont <-aggregate.alfred(dat.2)
head(dat2.cont)
##      Continent  N2  Nq
## 1       Africa 832 498
## 2         Asia 402 259
## 3     EastAsia 588 377
## 4       Europe 629 414
## 5 NorthAmerica  83  46
## 6      Oceania 120  91
barplot(dat2.cont$Nq/dat2.cont$N2,names.arg=dat2.cont$Continent,las=2)

And quite clearly, this site is much less variable than is the first one we examined. We can do the nested Fst analysis on it

dat2.nest <-fst.nest(dat.2)
dat2.nest
##           Fsr        Frt        Fst
## 1 -0.03188691 0.04874495 0.01841237

And if we restrict ourselves to the aggregated-by-continent data, we see the same:

dat2.cont.fst <-fst.cont(dat2.cont)
dat2.cont.fst
## [1] 0.01634996

And in fact se see negligible evidence for differentiation at any level. So what value should we use to characterize human population structure?

The Bigger Picture

What we will do now is to look at Fst values for 1000 random SNPs drawn from the ALFRED database. It is possible to download a table that includes those values for over 600,000 snps from ALFRED, however to save time, such a sample has been included as data in TeachingPopGen; we can load it as follows

data(HsFstSample)
#ch1 <-read.table("http://www.users.miamioh.edu/cochrabj//SitesWithFst.txt",header=TRUE)

We can then plot the distribution of Fst’s for these SNPs and compare them to those we obtained for our two previously examined SNPS

ch1.sub <-HsFstSample

hist(ch1.sub$Fst,xlab="Fst",main="Fst - 1000 Random SNPs")
fmean <-mean(ch1.sub$Fst)
abline(v=fmean,col="red")
abline(v=fst.c, col="blue")
abline(v=dat2.cont.fst,col="darkgreen")

For the total Alfred dataset, the mean value for Fst is 0.137622. This is actually in quite good agreement with the value of 0.12 recently reported based on a data set of 1.1 million autosomal snps (???). And in fact, the two SNPs we analyzed are in the opposite tails of the distribution, suggesting that neither, by itself, is a good indicator of population structure.

Conclusions

We started this unit by focusing in on one particular SNP, in a region in which multiple lines of evidence have suggested the occurrence of Neanderthal introgression followed by selection. We saw that analysis with Hierarchical F statistics provide a means of quantitating the level of differentiation. We furthermore showed that, at least at the inter-continental level, this SNP is an outlier with respect to Fst, indicating that some process other than population subdivision is affecting its frequency distribution.

We also showed that the mean value for intercontinental Fst, based on 1000 SNPs is 0.137622. It is important to understand what this means - ca. 13 percent of the variation in human population is the result of genetic differentiation among continents`. Figures like this have been reported using a variety of genetic markers - allozymes, microsatellites, SNPs and others - and have been used as evidence to argue against race (at least as correlated with continent) as a meaningful biological construct.

But there is one important point we have overlooked. In everything we have discussed, we have considered each SNP in isolation. In fact, of course, a population consists of individuals with multilocus genomes, not ones with collections of independently assorting genetic elements. So what happens if we broaden our view and address questions about population structure questions based on genotypes rather than genes? Stay tuned.

References

library(hierfstat)
library(knitr)
library(scatterplot3d)
data(IGS)

Detecting Population Structure with Principal Components Analysis

In what we have done so far, we have been able to partition genetic variation among levels - x percent within populations, y percent among populations within regions, etc. But what we haven’t been able to do is to establish relatedness among populations. So, for example, given some observed value of Fst, we do not know which populations differ from which others and by how much. Furthermore, we are imposing our expectation of how the population should be structured on the data - wouldn’t it be nice if we could “let the data do the talking” and empirically deter appropriate grouings. This is what we will do in what follows, first with Principal Components analysis and second with STRUCTURE, a Bayesian genotype assignment algorithm.

The essence of PCA

PCA is, in simplest term, a way to reduce the dimensionality of multivariate data. In a genetic context, we could have populations with allele frequencies at multiple loci or (even better) individual multilocus genotypes. How do we take such variation and make sense of it, either visually or mathematically?

We’re going to start with a simple hypothetical example, in which we look at allele frequencies for two loci in eight populations:

dat.samp <-data.frame(sample=1:8, p1=c(.32,.36,.58,.59,.78,.75,.85,.83),p2=c(.4,.55,.42,.78,.60,.90,.80,.95))
dat.samp
##   sample   p1   p2
## 1      1 0.32 0.40
## 2      2 0.36 0.55
## 3      3 0.58 0.42
## 4      4 0.59 0.78
## 5      5 0.78 0.60
## 6      6 0.75 0.90
## 7      7 0.85 0.80
## 8      8 0.83 0.95

Now, let’s make an xy plot of the data:

plot(dat.samp$p1,dat.samp$p2,ylim=c(0,1),xlim=c(0,1),xlab="p1",ylab="p2")
segments(.36,.4,.85,.88,lty=2,col="red")
segments(.57,.8,.75,.5,lty=2,col="blue")

When we look at these data, we see that the variation in it can be seen as having two dimensions, however those dimensions are not in the same orientation as the x and y axes of the graph. Rather, they are tilted, as is roughly visualized by the red and blue lines. So what we can imagine (and what pca does) is to rotate the data, and to center it on zero, such that now we have two axes that represent the variation in the data

dat.pr <-prcomp(dat.samp[,2:3])
biplot(dat.pr,var.axes=FALSE)

summary(dat.pr)
## Importance of components:
##                           PC1    PC2
## Standard deviation     0.2758 0.1073
## Proportion of Variance 0.8685 0.1315
## Cumulative Proportion  0.8685 1.0000

This is nothing special - we’ve rescaled the data such that we now can describe it in terms of two variance components, one of which (component 1) explains 86% of the variation in the data and the second explains 13%

But what if we add a third set of frequencies to the data?

p3 <-c(.3,.4,.2,.3,.7,.8,.5,.6)
dat.samp3 <-cbind(dat.samp,p3)

Once again, we can plot the data, this time in three dimensions:

scatterplot3d(dat.samp3[,2:4])

And now, how can we make any sense of it? This is where PCA begins to show its usefulness:

samp3.pr <-prcomp(dat.samp3[,2:4])
biplot(samp3.pr,var.axes=FALSE)

summary(samp3.pr)
## Importance of components:
##                           PC1    PC2     PC3
## Standard deviation     0.3224 0.1308 0.10727
## Proportion of Variance 0.7841 0.1291 0.08679
## Cumulative Proportion  0.7841 0.9132 1.00000

Note that now, while the eight samples are relatively evenly distributed along PC1, PC2 clearly separates the samples into two groups - 1,2,5, and 6 vs. 3,4,7, and 8.

PCA is an example of what data scientists call “unsupervised learning”. If you are interested in learning more, this video is an excellent overview of the method:

Some real data

But enough of hypotheticals. Let’s see what we can do with genotype data, using the Idaho Giant Salamander data as our source:

head(IGS)
##   Catchment  Basin Stream  1  2  3  4  5  6  7  8  9
## 1    Badger Lochsa   LBU1 45 33 46 33 56 23 23 22 44
## 2    Badger Lochsa   LBU1 55 33 34 33 55 34 13 22 44
## 3    Badger Lochsa   LBU1 55 33 44 33 55 23 12 22 44
## 4    Badger Lochsa   LBU1 55 33 45 33 56 33 13 22 44
## 5    Badger Lochsa   LBU1 55 34 46 33 56 23 33 22 44
## 6    Badger Lochsa   LBU1 55 33 34 33 55 34 33 12 34

Notice that we are now working with individual genotypes, coded as 2 numbers. The important point to note is that when we do our PCA’s the location information will not be incorporated into our analysis. What we will do is to color code the the points on the plots, so that we can relate the PCA results to geography:

dat.basin <-IGS[,c(2,4:12)]
basin.pca <-indpca(dat.basin)
plot(basin.pca,cex=0.5,col=c(rep(1,180),rep(2,90),rep(3,91)),ax1=1,ax2=2)

And we can see that in fact the three basins do group separately, although the St. Joe and St. Regis overlap to a greater extent than does Lochsa with either of the others.

What is plotted is, of course, the first two PC’s, the ones that explain the greatest of the variance. In fact, given the multidimensionality of the data, there are many more PC’s we could look at. What we need to ask is How much of the variance are we explaining? We can plot

barplot(100*basin.pca$ipca$eig/sum(basin.pca$ipca$eig),xlab="Principal Component",ylab="Percent Variance Explained")

In this case, taken together, the first two PC’s explain only about 17% of the variation between them - not an ideal situation. We could explore this further, but for now we’ll take another approach - use the same pca results but color the points by catchment:

catch.freq <-table(IGS$Catchment)


test <-sapply(1:8, function(x) rep (x,catch.freq[x]))
col.catchment <-unlist(test)
plot(basin.pca,cex=0.5,col=col.catchment,ax1=1,ax2=2)

So we see that there is some degree of clustering by catchment, however the major division remains that between basins on axis 1. Could we learn more by looking at additional components? Remember that we just saw that they do explain significant amounts of variation. Let’s keep axis 1 in our comparison but now plot it against axis 3

plot(basin.pca,cex=0.5,col=col.catchment,ax1=1,ax2=3)

This is a bit more informative - certainly the data suggest further exploration is in order.

Another example - Human population structure

Remember our exploration of human genetic variation via Fst analysis - based on 1000 random SNPs in ALFRED, we found a mean value of 0.137, based on partitioning variation among continents. What if we were to apply PCA to similar data? In this case, ALFRED won’t work, since it only reports frequencies, not individual genotypes. Instead, we will turn to a sample from the 1000 genomes project:

data("SampledSNPs.gen") #In TeachingPopGen
## Warning in data("SampledSNPs.gen"): data set 'SampledSNPs.gen' not found
data("poplist")
dat.gen
## /// GENIND OBJECT /////////
## 
##  // 2,504 individuals; 179 loci; 354 alleles; size: 3.6 Mb
## 
##  // Basic content
##    @tab:  2504 x 354 matrix of allele counts
##    @loc.n.all: number of alleles per locus (range: 1-2)
##    @loc.fac: locus factor for the 354 columns of @tab
##    @all.names: list of allele names for each locus
##    @ploidy: ploidy of each individual  (range: 2-2)
##    @type:  codom
##    @call: df2genind(X = as.matrix(x[, attr(x, "locicol")]), sep = "[/\\|]", 
##     pop = pop, NA.char = ".")
## 
##  // Optional content
##    @pop: population of each individual (group size range: 347-661)

This is a random sampling of loci, with MAF>0.05, from all of the autosomes. We could use a larger dataset, but it would tax our poor laptops and the results would be similar. So remember our 1000 genomes populations -2504 genomes, 26 populations from five “superpopulations”. What do they look like in pca?

dat.pca <-indpca(dat.gen)

And once that runs, we can plot it

plot(dat.pca,cex=.3,col=as.numeric(poplist.1000g$Superpopulation.code),eigen = FALSE,ax1=1,ax2=2)
leg.txt <-c("African","North American","East Asian","European","South Asian")
legend(5,-7.0,leg.txt,cex=.6, fill=1:5,bty="n")
title("Human Genetic Diversity",cex.main=.8,col.main="darkred")

And we see that, with the possible exception of the North American population, the first two PC’s discriminate among superpopulations quite well (and axis 3 does provide further discrimination with respect to NA).

But does this mean that the eugenecists were right, and that race and genotype are inextricably linked? We can address this in two ways. First, we can look at the variance explained by the PC’s, just as we did for the salamanders:

barplot(100*dat.pca$ipca$eig/sum(dat.pca$ipca$eig),xlab="Principal Component",ylab="Percent Variance Explained")

And we see that the firat two components explain about 13% of the total variance - in other words there is a heck of a lot more genetic similarity among populations that there are differences between them. Second, we can do nested F statistic analysis of these data, based on differentiation among populations within superpopulations, and after a while we get the following:

dat.hf <-genind2hierfstat(dat.gen)
levels <-data.frame(poplist.1000g[,3:2])

f.human <-varcomp.glob(levels,dat.hf[,-1])
f.human$F
##                      Superpopulation.code Population.code         Ind
## Total                          0.09308305     0.100469540 0.103308639
## Superpopulation.code           0.00000000     0.008144614 0.011275109
## Population.code                0.00000000     0.000000000 0.003156201

So the two kinds of analyses lead to similar results

Conclusions

PCA is often a great way to start doing this sort of analysis, and you will encounter it in the literature frequently. Here we used genotypes as input; we could also use a distance matrix composed, for example, of pairwise Fst measures. But remember - those are based on our assumptions about population structure - what we have done here is not. But even so, the output we have is of rather limited informativeness. To refine our analysis further, we need a much better way of assigning genotypes to groups, and then using the results to construct meaningful biological hypotheses.